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Summary. We propose a penalized orthogonal-components regression (POCRE) for large p 
small n data. Orthogonal components are sequentially constructed to maximize, upon stan- 
dardization, their correlation to the response residuals. A new penalization framework, imple- 
mented via empirical Bayes thresholding, is presented to effectively identify sparse predictors 
of each component. POCRE is computationally efficient owing to its sequential construction 
of leading sparse principal components. In addition, such construction offers other properties 
such as grouping highly correlated predictors and allowing for collinear or nearly collinear pre- 
dictors. With multivariate responses, POCRE can construct common components and thus 
build up latent-variable models for large p small n data. 

Keywords: Empirical Bayes thresholding; Latent-variable model; p > n data; POCRE; Sparse 
predictors; Supervised dimension reduction. 



1. Introduction 

Available high-throughput biotechnologies make it possible to comprehensively analyze ge- 
nomic, proteomic, or metabolomic profiles of biological samples, thus identifying molecular 
signatures to understand complex biological systems. Such profile analysis holds an enor- 
mous promise for its use in early disease detection, assessment of prognosis, measurement of 
drug efficacy, and eventually, personalized medicine. However, it usually entails collection 
of a massive amount of possible predictors (i.e., large p) from each of a small number of 
biological individuals (i.e., small n), and therefore identifying the underlying sparse predic- 
tors presents a task of "finding a very few needles in a haystack" . The structured and noisy 
pr edictors make the task even more difficult. 

Breiman fl996f ) showed that classical step-wise regression is unstable since modifying 



a single observation can change the fitted model significantl y. On the other h and, ridge 
regression is stable but it lacks the ability to select variables. iTibshirani fl996h employed 
an ^i-norm penalty and proposed the lasso method, which gained popularity due to its 
ability to select variables and, at the same time, exhibit the stability of ri dge regression. Thi s 
method has a Bayesian i nterpretation with independent Laplace priors ( Tibshirani fl996h ; 



iPark and Casella f 20081) 1 However, lasso lacks the grouping proper ty, that is, it tend s to 



select one predictor from a group of highly correlated predictors, see lZou et al. (20051 ) for 
more details. 

The grouping property plays an important role in analyzing p ^ n data with clustered 
but noisy predictors. The predictors for molecular signatures are naturally grouped due to 
sharing metabolomic pathways or biological processes, and are preferred to be included or 
excluded from the model simultaneously. On the other hand, highly correlated predictors 
can borrow strength from each other to counter the noise effect. Many lasso variants have 
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therefore been propose d to take advanta ge of the grouped predictors either impUcitly or 
expUcitly. For example, Zou et al. f2005 h proposed the elastic net (EN) which added a 



expncmy. ror example. | ZjOU et ai. izuua i proposea me eiasiic nei [rji^ ) wnicn auueu a 12- 
norm penalty; ITibshirani et al. (20051 )" proposed the fused lasso inc luding another £i-norm 



penalty to encourage similarity between coefhcients; and lYuan and Lin (20061 ) proposed 
the group lasso which modified the £i-norm penalty for grouped coefficients. 

Another strategy in analyzing p ^ n data is to first reduce the dimension of predic- 
tors by constructing components, i.e., "eigen" predictors, and then fit regression models 
by applying step-wise approaches to these components. Such construction of components 
not only provides a potential solution to the "curse of dimensionality" , but also groups 
predictors which are highly correlated or share certain common coherent patterns. Both 
unsupervised and supervised dimension reduction methods have been proposed. While 
many uns upervised methods have been proposed o n the basis of p rincipal component anal- 



many uns upervisea metnoas nave oeen proposea o n tne oasis or p rmcipai component anal- 
ysis f PCA:lHastie g/. (2000l ). lBair et aL (20061 ) . ICook (20071) ). the partial least squares 



(PLS; iGarthwaite (19941 )) regression is a supervised ap proac h and has been widely use d 
in chemometrics and bioinformatics, see Kramer (1998 ). and iNguven and Rocke 



among others. 

In this paper we propose a penalized orthogonal-components regression (POCRE) via 
a new penalization framework which can effectively identify sparse predictors from a large 
number of candidates. Section 2 presents the general idea of orthogonal-components regres- 
sion, and the penalized orthogonal-components regression is proposed in Section 3. The 
pe nalization is implemented in Sect ion 4 using the empirical Bayes thresholding proposed 
bv I Johnstone and Silverman (2004 ). Such implementation allows adaptively identifying 



sparse predictors and leads to the computationally efficient POCRE algorithm which is 
summarized in Section 5. Simulation studies and real data analysis are shown in Section 6 
and 7 respectively. We conclude this paper with a discussion. 

2. Orthogonal-Components Regression 

To illustrate the ideas behind the orthogonal-components regression, we assume 

Y^P'^X + e, (1) 

where y is a /c-dimensional column vector, X is a p-dimensional column vector independent 
of e, E[X] = 0, and P is a p x k matrix. When var{X) is non-singular and the sample size 
n is reasonably larger than p, either likelihood method or moment method can provide a 
satisfactory estimate of /?. 

Here we are interested in estimating f3 in the large p paradigm. First, var(X) may be 
singular or nearly singular due to coUinear or highly correlated predictors in X. Second, 
when p is too large, it is usually infeasible to assume that the sample size n is larger than 
p. In either case, it is difficult, if not impossible, to estimate /3 using the classical methods. 

To avoid possible problems with large p, we construct orthogonal components as lin- 
ear combinations of all predictors in X, and then regress Y on these orthogonal com- 
ponents. Such orthogonal components can be sequentially constructed. Specifically, let 
Xi — X and Yi — Y. The first component ujfXi is constructed with uj = uti maximizing 
||cou(yi, cj-^Xi)!!^ under the condition = 1. Since 

\\coviY,,u;^X,)f = \\cov{Y,uj^X)f, 

oji is the leading eigenvector of cov(Y, X)'^ cov{Y, X). Here the leading eigenvector refers to 
the one with the largest eigenvalue. When Y is univariate, i.e., k — 1, ivi oc cov{Y, X)^ . 
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After constructing the j-th component Xj , we then remove (^J Xj from Xj such that 

var{Xj)ujj 



Xj+i — Xj — 9jUjJXj is uncorrelated to uiJXj, i.e 



cov{Xj+i,ujj Xj) =0 



ujJvar{Xj)ujj 



We also remove ujJXj from Yj such that 1}+! — Yj — -djUjJXj is uncorrelated to ujJXj, i.e. 

7l\ 



cou(y,-+i , uj] Xj) =0=^ d, 



uj^ var{Xj)ujj 



Then the (j + l)-st component ujJ^iXj^i is constructed with uj = ujj+i maximizing 

\\cov{Y,u;'^X,+i)\\^ = \\coviYj+uu;^Xj+i)\\^ 

under the condition = 1. Note that Wj+i is the leading eigenvector of cov{Y, Xj+i)"^ x 
cov{Y, Xj^i). When k = 1, ujj+i equals to the normalized cov{Y, Xj^i)'^ . 
This construction stops whenever Y is uncorrelated to Xj . Since 



u;JX, = iojil - 0,_iu;f_i)l,_i = ... = coJ lYiil- 9,^iujJ_,) X, 




we denote the j-th component as vjJ X . Upon the completion of the construction, X , 
ZU2X, ■ ■ •, are uncorrelated, i.e., they constitute a sequence of orthogonal components, 
which lead to the orthogonal-components regression model. 

Theorem 1. vufX, vj\X , ■ ■ •, are orthogonal, i.e., uncorrelated. Furthermore, 

i?[r|X]=^^,(n7jx). (2) 



Compared to the original regression ([T]), the orthogonal-components regression can 
be fit by only calculating the eigenvectors of matrices but not the inverses, which makes it 
appealing in analyzing p 3> n data. Furthermore, if the predictors are highly correlated or 
even collinear, the orthogonal-components regression is still able to provide robust solution. 
The calculation is very fast due to the fact that mfX, zuJX, • • •, can be easily constructed 
and that they are uncorrelated. 



3. Penalized Orthogonal-Components Regression 

Implementing the orthogonal-components regression ^ is subject to finding the leading 
eigenvector of cov{Y, Xj)'^ cov{Y, Xj) to construct the j-th component ^JX. However, the 
involved covariances are not observed and need to be estimated from the observed data 



say the i.i.d. sample (Y„xfc,X„xp)- Wold (19751 ) estimated the covariances with their 



empirical estimates and proposed the partial least squares. Each subsequently constructed 
component is a linear combination of all available predictors. In the case oi p ^ n data, 
especially when only a small number of predictors contribute to the response variables. 
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the results from partial least squares regression inflate the errors besides the difficulty in 
interpreting the results. Here we will pursue a penalized construction for sparse loadings. 
Let 

M = mv{Y,Xj), 

be an estimate of cov{Y, Xj). A major step in implementing the orthogonal-components 
regression is to fi nd the leading sparse eigenvector of M-^M. The following theorem by 
Zou et al. r2006tl implies that finding the leading eigenvector can be taken as an optimiza- 



tion problem, which sheds light on constructing sparse eigenvectors. 



Theorem 2. \Zou et al. (200a )) For any k > 0, let 

(a, 7) = argmina,^:\\a\\=i {||M - M-ia^W^ + k||7|P} . (3) 

Then, lo = 7/II7II is the leading eigenvector of M-^M, i.e., M-^Mtj = cuj where c is the 
largest eigenvalue of M-^M. 

To ensure a sparse principal component, we consider a general version of the criterion 
i.e., with timing parameter A and penalty function pa(7), 

(a(K),7(K)) = argmin^,^.,\\^\\^^ {||M - M7a^||2 -|- ^hlP + Pa(7)} • (4) 

Here the penalty is introduced to benefit estimating covariances and thresholding 7 such 
that most of the elements in 7 are zero, i.e., 7 is sparse. While Theorem 2 implies that 
specific value of n does not affect the solution to optimization problem ([2]), the following 
theorem states that sparse 7 can be derived from a problem without specifying k in ([¥]). 

Theorem 3. Suppose p\(c^) — cp\(j) for any scaler c > 0. Let (d(K),7(K)) be the solution 
to (|3]). And (0,7) is the solution to the following problem 

(q,7) = argmm„^^,||„||=i {-27^M^Ma + hf+px^)} ■ (5) 

Then, 7(k)/||7(k)|| approaches to 7/II7II when k 00. 

We will iteratively solve ([5]) for a and 7. First, for a given 7, we have 

d(7) = argmm^. ii^ii^i {-27^M^Ma} = M^M7/||M^M7|| . 

Second, for a given a, we have 

7(a) = argmiriy { II7 — M'^Ma||^ -1- Pa (7)} , (6) 

which will be approximated using the empirical Bayes thresholding as discussed in the 
following section. 

4. Penalization via Empirical Bayes Thresholding 

Denote Z — M-^Ma. Then solving for 7(a) in ([6]) is subject to minimizing ||Z — 7||^+pa(7) 
with respect to 7. Suppose the i-th component of Z is Zi, and further assume. 



Zi = fJ-i + ei, ei~iV(0, CT^). 



Penalized Orthogonal-Components Regression 5 

Since p is large and most of {/i^, 1 < i < p} are zero, the variance can be estimated by 

<T = media7ii<^<p {\z,\} /^-^{0.75). (7) 

Note tha t this estimat e partiaUy accounts for under- or over-dispersion due to dependent 
data, see Efron When implementing the penalization of POCRE, we also introduce 



a tuning parameter A to account for the possible over-dispersion when standardizing Zi using 
Act. Without loss of generality, hereafter we assume e,; *~ N{0, 1). 

When px{-) is specified by the logarithm of a prior density function, the optimal 7 is 
indeed a Bayesian estimate of (/ii, • • • In consideration o f the sparsity of 7, we em- 



ploy the empirical Bayes thresholding (EBT) proposed bv , Johnstone and Silverman (2004 . 



20051) for a better approximation to the leading sparse eigenvalue of M^M 



Specifically, we assume a mixture prior with a point mass at zero and a quasi-Cauchy 
distribution for each /Zi, i.e., 

t:(P) = (1 - w)do[fi) + ■w—= < 1 



/2^ I HiJ-t) 

where 6o{-) is Dirac's delta function. Since the marginal distribution of Zi is 

an estimate of w, say ii, can be calculated by maximizing the marginal likelihood. Then fii 
can be estimated by the posterior median, i.e., 

fii — jji{zi) — median{^i\zi, w) . 

As w provides a data-driven estimate of the param eter sparsity, the resultant estima te is 



adaptive to the sparsity of the underlying parameter. iJohnstone and Silverman (2004 ) also 
showed that the empirical Bayes estimator fi(z) is a thresholding estimator in the sense that 
(i) fi{z) is increasing on z £ R; (ii) \fl{z)\ < \z\, Vz G R; (iii) fi{—z) = —ft{z); (iv) there 
exists r > such that /i(z) = if and only if \z\ < t. 

As noted above, although p,i is constructed by assuming all components of Z are inde- 
pendent, using the estimate ct in ([7]) and the tuning parameter A in the penalty function 
p\{-) account for possible dependence. In practice, ten-fold cross-validation can be employed 
to elicit the optimal value of A ranging from 0.6 to 1. As demonstrated by our simulation 
studies, it usually suffices to consider A G {0.8, 0.81, 0.82, • • • , 1}. 



5. The Algorithm 

Without loss of generality, we further assume that both X and Y are centered. Therefore, 
an estimate of cov{Y,X) is M cx Y^X. Suppose wi, • • • have been calculated, and 

Xj has been updated accordingly. An estimate of cov{Y,Xj) is proportional to Y-^Xj. We 
can therefore proceed to find coj as follows, 

1. Initialize 7 to be the leading eigenvector of XjYY-'^Xj; 

2. Update a = XjYY^X,7/||XjYY^Xj7|| ; 

3. Calculate a = median {\X.jYY^Xja\} /<i>-^{0.75); 
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^ TT J ^ - ( XfYY^XjoN . . 

4. Update 7 = ^1 — xs I 

5. Repeat 2-4 until convergence, then ujj = 7/II7II; 

6. Calculate rjj — 'X.jUjj; 

7. Calculate Pj — ijJ'X.j/rjJrjj, and update Xj+i = Xj — rjjPj. 

Note that the first five steps are used to calculate the first principal component of 
XjYY-^Xj, which is adaptive to the sparsity of the non-zero loadings. A mong these step s, 
the first step may be easily implemented using the following power method (Stewart (1974))), 
which has been used for the nonlinear iterative partial least squares (NIPALS: iWold (1975( )). 

l.a. Initialize ip to be the first column of Y^-; 

l.b. 7 = XJ^/|!XJ^||; 

I.e. 1] = Xj7; 

l.d. ^ = YVI|Y77||; 

I.e. ip — Yep; 

l.f. Repeat l.b - l.e until the convergence of 7. 

When LUj converges to the leading eigenvector of XjYY-^Xj, then rjj is an eigenvector 
of XjXjYY"^, which defines the j-th orthogonal component. Note that Pj in Step 7 helps 
calculate X^+i due to the fact that 77jXj_|_i = 0. 

Since 

Xj + l = Xj - TJjPj = Xj(/ - UJjPj), 

when writing Xj+i = XCj+i, Cj+i can be sequentially calculated as follows, 

Ci = ipxp-, 0+1 = Cj{i - ^jPj), j = 1, 2, • • • . 

Suppose that the above algorithm stops at (I + l)-st step, i.e., uji+i = 0. Then we regress 
Y on the orthogonal components ijj, j — 1,2, ■ ■ ■ ,1, and fit the following model, 

; 

which implies that Qj = r/jY / r/J rjj . Since rjj = XQujj, the estimate (3 oi (3 in ^ can then 
be derived as 

I 

6. Simulation Studies 

We consider five different cases of large p small n data to evaluate the performance of 
POCRE and compare with other approaches such as partial least squares (PLS), ridge 
regression, lasso, and elastic net (EN). The first two cases have highly and mildly correlated 
predictors respectively, the third one has clustered predictors, the fourth one demonstrates 
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a measurement-error model, and the fifth one features a latent-variable model. In all cases, 
we fix p = 1000 and consider both n = 50 and n — 100. 

Case 1 (High Correlations). Y = 2^j° ^ Xj + Y.]l°ioi + where e ~ 7V(0, 1), 
and each block {^fc-i-ii ■ ■ ' j -''^fc+ioo} is simulated from an AR(1) process with p — 0.9, 
fc = 0,100, •••,900. 

Case 2 (Mild Correlations). Same as Case 1 except that p — 0.5. 

Case 3 (Clustered Predictors). Y ^ l-^Y.%iX] + where e - 7V(0,152), and 
Xj = Zil{j<io} + ^2l{ii<j<20} + 2'3l{2i<j<30} + Ci- Here Zi,Z2,Z3 A^(0, 1), and 
£,j - A^(0,0.01). 

Case 4 (Errors in Predictors). F = Zi + 2^2 + + e, where e iV(0, 1). Note 
that Xj = si5n(5.5-j)Zil{j<io} +5*5^(15.5- j)Z2l{ii<j<20} + ^3l{2i<i<30} where 
Zi, Z2, Z3 - iV(0, 1), and A^(0, 1). 

Case 5 (Latent- Variable Model). Yk = akZi + bkZ2 + Efc, 1 < fc < 5, where 
oi = 02 = 62 = 2, 61 = 03 = &3 -2, 04 = 05 = 3, 64 = -65 = 1.5, and Sk N{0, 1). 

Zl — ^50 + -^150 + -'^250 + -'^350 + -^450+-'^550 and Z2 = ^51+-'^153 + -'^256+-^359+-^467+-^583, 

where X's are the same as in Case 1 except that p = 0.3. 

Here we evaluate the algorithms on the basis of two different criteria, i.e., the loss defined 
as — y|p|/3] — tr{var(Y\X)}, and the false discovery rate (FDR). In each case, we 

simulated 100 datasets, and therefore calculated the values of the loss and FDR on the 
basis of the estimated parameters. Ten-fold cross-validations are used to find the optimal 
tuning parameters for EN, lasso, POCRE, and ridge regression, and the optimal number of 
components for PLS. 

Since neither PLS nor ridge regression selects variables and both instead build up the 
model using all available predictors, FDR is not reported for either method. In all cases, 
both methods report very large losses compared to the other three methods due to infiated 
prediction errors by using all predictors. It is interesting to note that both PLS and ridge 
regression perform similarly in terms of losses, although PLS is able to build common 
components for multivariate responses. 

In Case 1 with highly correlated predictors, both lasso and POCRE present much smaller 
losses than EN, as shown in Table [TJ When the correlations between predictors are mild 
as in Case 2, the losses of both EN and POCRE dramatically decrease but the loss of 
lasso increases when n — 100. For n = 50, all three methods increase the losses with lasso 
increases the most. In both cases, lasso presents the smallest losses. However, POCRE is 
able to build up common components shared by multiple responses and lowers the losses, as 
shown in Case 5. Indeed, POCRE has much smaller loss than other methods for n — 100, 
and is comparable to lasso for n — 50. 

In Case 3 with clustered predictors, POCRE performs extremely well when compared to 
all other methods. In Case 4 with errors in predictors, POCRE also presents the smallest 
losses. Indeed, in Case 3, POCRE decreases 55.82% and 79.30% of the losses when compared 
to the best of all other methods for n = 50 and n = 100, respectively. And in Case 4, 
POCRE decreases 22.61% and 40.00% for n = 50 and n = 100, respectively. Therefore, 
POCRE prevails in handling clustered or noisy predictors due to its building up components 
through maximizing their correlations to the response variables. 
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Table 1 . Summary on losses (with standard errors in parentlieses) 



n 


Method 


Case 1 


Case 2 


Case 3 


Case 4 


Case 5 




EN 


29.80(1.31) 


2.03(1.53) 


103.34(4.35) 


1.45(0.04) 


13.48(1.29) 




Lasso 


0.66(0.02) 


1.76(0.10) 


72.12(4.04) 


1.59(0.03) 


12.47(0.79) 


100 


PLS 


81.44(1.15) 


89.94(0.48) 


187.57(3.25) 


3.10(0.02) 


254.43(0.79) 




POCRE 


6.13(0.53) 


3.58(0.42) 


14.93(2.81) 


0.87(0.03) 


4.74(1.99) 




Ridge 


81.60(1.13) 


89.71(0.44) 


193.90(3.21) 


3.09(0.02) 


253.18(0.52) 




EN 


39.23(2.09) 


52.45(2.65) 


141.90(7.93) 


2.30(0.13) 


250.51(2.92) 




Lasso 


1.98(0.13) 


33.24(1.66) 


167.93(9.64) 


2.74(0.06) 


234.97(3.21) 


50 


PLS 


196.82(2.25) 


111.26(0.73) 


331.31(4.35) 


4.24(0.03) 


273.23(0.83) 




POCRE 


9.10(2.00) 


40.88(2.05) 


62.69(5.78) 


1.78(0.06) 


236.53(5.17) 




Ridge 


192.01(2.26) 


110.56(0.53) 


333.79(4.45) 


4.22(0.03) 


269.71(0.62) 



Table 2. Summary on FDR 



n 


Method 


Case 1 


Case 2 


Case 3 


Case 4 


Case 5 




EN 


0.9603 


0.7260 


0.4118 


0.7216 


0.8452 


100 


Lasso 


0.5745 


0.7037 


0.7931 


0.6087 


0.8421 




POCRE 


0.5745 


0.1304 


0.0909 


0.1724 


0.2500 




EN 


0.9184 


0.8365 


0.7285 


0.8167 


0.9622 


50 


Lasso 


0.4722 


0.6818 


0.8222 


0.6333 


0.8197 




POCRE 


0.3103 


0.5102 


0.1892 


0.4194 


0.7742 



In all cases, POCRE performs the best in terms of FDR, as shown in Table [D With 
n = 100, POCRE can control the FDR under 25% for all cases except Case 1 in which the 
FDR is at 57.45% as POCRE tends to include predictors which are highly correlated to 
those true predictors. On the other hand, lasso presents FDR as high as 84.21%, with the 
lowest level at 57.45%. Not surprisingly, EN performs better than lasso in Case 3, i.e., with 
the lowest FDR at 41.18%, as it can account for group effects of predictors. However, it 
presents higher FDRs than lasso for all other cases. With n = 50, although POCRE still 
presents lower FDRs than other two methods, all methods present high FDRs except that 
POCRE has the FDR at 18.92% in Case 3. 

7. A Real Data Analysis 

Lan et al. (20061) designed an experiment to identify the genetic basis for differences be- 
tween two inbred mouse populations (B6 and BTBR). A total of 60 arrays were used to 
monitor the expression levels of 22,690 genes of 31 female and 29 male mice. Some phys- 
iological phenotypes, including numbers of stearoyl-CoA desaturase 1 (SCDl), glycerol-3- 
phosphate acyltransferase (GPAT) and phosphoenopyruvate carboxykinase (PEPCK), were 
also measured by quantitative real-time RT-PCR. The gene expression data and the phe- 
notypic data are available in GEO (http: / /www.ncbi. nlm.nih.gov/geo, accession number 
GSE3330). 

We adjusted the phenotypic values to remove the possible gender effects. For each 
phenotype, its correlation to each gene is calculated, then an overall correlation coefficient 
(OCC) of the three phenotypes to a single gene is defined as minimizing the absolute values 
of the correlation coefficients between the gene and three phenotypes. Here we investigated 
expression profiling of the top 5,000 genes (ranked on the basis of OCC) to predict the three 
physiological phenotypic values. We set up the test dataset including randomly selected 5 
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Table 3. Summary on Real Data Analysis 





Sum of Squared Prediction Error 


Number of Selected Genes 


Method 


SCDl 


GPAT PEPCK Total 


SCDl 


GPAT 


PEPCK 


EN 


3.96 


22.07 2.59 28.62 


255 


34 


5000 


Lasso 


6.38 


22.07 2.87 31.32 


1 


34 


8 


POCRE 


3.15 


16.11 1.93 21.19 


195 


106 


58 



(a) uji (b) UJ2 (c) u)3 

Fig. 1. Ljj, j = 1, 2, 3 generated by POCRE for SCD1 . 



female and 5 male mice, and the rest are included in the training dataset. We built up the 
model using the training dataset and then calculated the sum of squared prediction errors 
(SSPE) using the test data. 

With each of EN, lasso, and POCRE, we separately build up regression models for each 
of the three physiological phenotypic values. The results are presented in Tabled Overall, 
lasso tends to select small number of predictors, and also reports the largest SSPE. On 
the other hand, POCRE reports the smallest SSPE for each phenotype, and selects smaller 
number of predictors for both SCDl and PEPCK, but larger number of predictors for GPAT 
than EN. POCRE generates three components for SCDl (see Figure[T]), and one component 
for each of the other two phenotypes (results not shown). 

We also fit a multivariate-response regression model for the three phenotypes using 
POCRE. Four common components are generated using a total of 277 genes. The resultant 
model reports SSPE for a total of 22.85 (i.e., 2.79, 18.14, and 1.93 for SCDl, GPAT, and 
PEPCK, respectively). The two regression models built by POCRE share only 21 genes for 
SCDl, 59 genes for GPAT, and 36 genes for PEPCK, although they report similar SSPE 
values. 



8. Discussion 

Effective dimension reduction is crucial for a successful analysis of p ^ n data. Traditional 
unsupervised dimension reduction can be used to exclude many features from constructed 
sparse predictors, but the false discovery rate (FDR) can be very high. On the other 
hand, available supervised dimension reduction, such as PLS, ignores the sparse nature of 
the underlying signatures. Furthermore, all these methods assume that the predictors are 
accurately measured, and do not incorporate functional relatedness of candidates. As a 
result, despite years of searching, only a handful of predictive biomarkers have advanced to 
general clinical practice. Clearly, more effective approaches are called if the true potential 
of predictive molecular signatures is to be realized. 
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POCRE builds up orthogonal components by aggregating contribution of predictors 
along the direction which maximizes their correlations to the response variables or residuals 
(when predictors are standardized). It sequentially constructs these orthogonal components 
by finding penalized leading principal components. The involved computation is efficient 
and feasible for large p small n data. As in Section 7 which presented a training dataset 
with n = 50 and j9 = 22, 690, POCRE, coded in MATLAB® , took less than two minutes to 
fit the regression model with four components (the tuning parameter was set at A = 0.75, 
and it was run on a desktop computer with Intel® 3.0GHz Core"'"'^ 2 Duo CPU). 

POCRE implements the penalization via an empirical Bayes thresholding. Since this 
empirical Bayes thresholding is constructed with a sparsity- adaptive prior, POCRE is au- 
tomatically enabled to select sparse variables in the large p small n paradigm. As shown 
in the simulation studies, it provides a clear and significant benefit to the general task of 
variable selection in the large p small n paradigm, even with clustered predictors or noisy 
predictors. It confirmed the utility of the new method in molecular profiling, thus indicat- 
ing an enormous promise for its use in transcriptional profiling (genomics), protein profiling 
(proteomics), methylation profiling (epigenomics) , and metabolite profiling (mctabolomics). 
The full potential of the new framework, however, lies in providing breakthrough solutions 
to implementing the Bayesian penalization for structured noisy features. 
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Appendix A: Proof of theorem 1 

Since for each j, cov{Xj+i,u)JXj) = 0, then for any Z > 0, 



i-i 



,wj+i_„) \ cov{Xj+i,(jJXj) = 0, 



m=l 



which proves that mfX, wJX, ■ ■ 
On the other hand. 



are uncor related and therefore orthogonal. 



Yi+i =Yi- diwJX = ■■■ = 



Suppose yj+i is uncorrelated to Then, 



E[Y\X] = ^j^JX + E[Yi+^ \X] 



Note that 



^i+i = ^i- (^i^J^i = ■■■ = 
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Therefore, 

I 

cov{Yi+uX) = cov{Yi+i,Xi+i) +J2cov{Yi+i,u;JXj)eJ = 0. 

Denote Yi+i = P'^X + e, then 

= =^ f3^VP = cov(f3^X, 13^ X) = =^ X = 0, 
which imphes that i5[Y;+i|X] = 0, and concludes the proof. 



Appendix B: Proof of theorem 3 

Denote 



(d(K),7(K)) = argmina,^:\\a\\=i 



Then 7(«)/ll7(«)ll=7(«)/ll7(«)ll- 
Since 



M - M^— 

1 + K 



1 + K 



1 + K 



M - M—^a^ 

1 + K 



+ K 



7 



1 + K 



7 



1 + K 



tr(M^M) + I + Y^tr (a^'^M^Mja^ ) + y:^7^7 + Pa (7) 



= tr{M^M) + <^ -27^M^Ma + 7^ 



1 + K 

Therefore, 

(q!(k),7(k)) = argmina,^.,iia\\=i |- 



1 + K 



7 + Pa (7) 



2Y Ma + Y — 7 + px (7) 

i "P K 



which imphes 

(q;(oo),7(oo)) =ars(mm„,7:||a||=i{-27^M^Ma+||7f +Pa(7)} = (a,7)- 
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